Deciphering “the language of nature”: A transformer-based language model for deleterious mutations in proteins

Various machine-learning models, including deep neural network models, have already been developed to predict deleteriousness of missense (non-synonymous) mutations. Potential improvements to the current state of the art, however, may still benefit from a fresh look at the biological problem using more sophisticated self-adaptive machine-learning approaches. Recent advances in the field of natural language processing show that transformer models—a type of deep neural network—to be particularly powerful at modeling sequence information with context dependence. In this study, we introduce MutFormer, a transformer-based model for the prediction of deleterious missense mutations, which uses reference and mutated protein sequences from the human genome as the primary features. MutFormer takes advantage of a combination of self-attention layers and convolutional layers to learn both long-range and short-range dependencies between amino acid mutations in a protein sequence. We first pre-trained MutFormer on reference protein sequences and mutated protein sequences resulting from common genetic variants observed in human populations. We next examined different fine-tuning methods to successfully apply the model to deleteriousness prediction of missense mutations. Finally, we evaluated MutFormer’s performance on multiple testing datasets. We found that MutFormer showed similar or improved performance over a variety of existing tools, including those that used conventional machine-learning approaches. In conclusion, MutFormer considers sequence features that are not explored in previous studies and can complement existing computational predictions or empirically generated functional scores to improve our understanding of disease variants.


INTRODUCTION
Whole-exome and whole-genome sequencing technologies are powerful tools for the detection of genetic mutations. A typical human genome has 4.1 million to 5.0 million variants when compared with the reference genome sequence, 1 while the average exome captures genomic regions that account for 1%-2% of the human genome. Therefore, distinguishing or prioritizing a small number of disease-related variants from such a large number of background variants becomes a key challenge in understanding genome and exome sequencing data. In particular, the interpretation of non-synonymous single nucleotide variants (SNVs) is of major interest, because missense mutations in proteins account for more than one-half of the current known variants responsible for human-inherited disorders, especially Mendelian diseases, where the mutations have high penetrance. 2 Unlike frameshift indels or splicing mutations in canonical splice sites that have a high likelihood to alter protein function, missense mutations change only a single amino acid, so most of them may not have significant impacts on protein function. To this end, population-specific allele frequencies, such as those inferred from the ExAC 3,4 and gnomAD 5 databases, can be useful to filter out common missense variants that are likely to be neutral, and mutation databases such as ClinVar [6][7][8] and the Human Gene Mutation Database (HGMD) 2 can be valuable resources to find previously reported mutations that may be deleterious. Still, a large number of missense variants from exome sequencing are not yet documented; therefore, the functional interpretation of such variants remains a crucial task.
Numerous computational tools have been developed to predict the deleteriousness or pathogenicity of missense mutations [9][10][11][12][13][14][15][16] ; however, as shown by multiple recent publications, the accuracy of predictive algorithms still has room for improvement. Databases such as dbNSFP [9][10][11] have now documented these whole-exome prediction scores for different prediction algorithms in an effort to facilitate the development of improved functional assessment algorithms. However, depending on the evaluation datasets that were used, most algorithms for missense variant prediction are 65%-80% accurate when examining known disease variants, and only approximately 43.4% of pairwise prediction correlations between different predictive algorithms are greater than 0.5. 9 Many conflicting predictions can be made between different algorithms, which motivated the development of several ensemble-based scoring systems that combine multiple prediction algorithms, such as MetaSVM, 17 REVEL, 18 and CADD. 19 In fact, predictions combined from different algorithms are considered as a single piece of evidence according to the American College of Medical Genetics and Genomics-Association for Molecular Pathology 2015 guidelines. 20 In addition, most existing computational algorithms are based on similar or related information (e.g., evolutionary conservation scores, mutation tolerance scores); potential improvements to the current state of the art could benefit from a fresh look at the biological problem using more sophisticated self-adaptive machine-learning approaches that examine additional types of information.
In other previously published prediction algorithms, deep learning-based sequence-focused models have been demonstrated as effective in modeling variant function. These existing methods primarily used convolutional neural networks (CNNs) to model sequences. [21][22][23] However, recently, advances in deep learning have shown transformer models to be particularly powerful for modeling sequential data. Transformer models, such as the Bidirectional Encoder Representations from Transformers (BERT), 24,25 rely on its central mechanism, selfattention. The use of self-attention allows the transformer model to achieve an unprecedented ability to model relationships between tokens in a sequence, which is crucial in the comprehension of linear sequences. In the past three years, transformers have achieved state-of-the-art performances on a broad range of natural language processing (NLP) tasks, 24,[26][27][28] and transformers are competitive with more traditional CNN-based models on image recognition tasks. 29 As of late, transformers have also been successfully applied for modeling protein structure in Alphafold2, 30 and in works such as Enformer, 31 which used transformers for DNA interpretation. Part of the reasons for the successes of transformers may be caused by their increased ability to handle subtle context dependency through a multi-head attention mechanism, and the ability to compute attentions in parallel to greatly speed up computation over typical recurrent neural networkbased algorithms.
In biological contexts, each amino acid in a given protein sequence exerts its function in a context-dependent manner, including both local dependency (such as forming a short signal peptide that was recognized by cellular machinery) and long-distance dependency (such as being close to another amino acid in threedimensional structure to form a binding site for ligands). Therefore, we hypothesize that transformer models would be capable of more effective modeling of protein sequences, somewhat similar to how transformers have transformed the field of NLP and language translation over the past few years.
In this study, we propose MutFormer, a transformer-based model, to assess deleteriousness of missense mutations. MutFormer is an adaption of the BERT architecture 24 to protein contexts, with appropriate modifications to incorporate protein-specific characteristics. MutFormer can analyze protein sequences directly, with or without any homology information or additional data. Our experiments show that MutFormer is capable of matching or outperforming current methods in the deleteriousness prediction of missense variants.
MutFormer is based on the BERT architecture. 24 A central component of the classical BERT model is its bidirectional self-attention. This mechanism uses a two-dimensional matrix to model the context between all positions in a given sequence, enabling efficient learning of long-range dependencies between residues. In contrast, convolution is another mechanism capable of learning dependencies, which is better suited for short-range dependencies: convolutions are more capable of prioritizing localized patterns via filters, while the repeated application of convolution filters, which are required for the relating of farther residues in a sequence, often weakens long-range dependencies. MutFormer takes advantage of both self-attention layers and convolutional layers to effectively learn both long-range and short-range dependencies.
In language modeling tasks, words or sub-words are short-range features of the sequence. The original BERT uses a fixed WordPiece vocabulary, which contains common words or sub-words in the training corpus. 32 This vocabulary cannot be tuned during the pre-training and fine-tuning process; therefore, a spelling error may introduce an out-of-vocabulary word that will hinder the model's interpretation ability of a given sequence. In protein sequences, "words" correspond with key subsequences or patterns of amino acids. These words can often be changed because of mutations, and furthermore, are unknown. Recent studies showed that vocabulary-free models (e.g., byte-level models) are more robust to noise and perform better on tasks that are sensitive to spelling. 33 Therefore, instead of using a fixed vocabulary, convolutional layers placed in between the embedding layers and the transformer body are used by MutFormer. Mut-Former uses these convolutions to learn its own vocabulary over the course of the training process, incorporating nonlinear patterns via the convolution filters. The weights of the convolutional layers are tuned during both the pre-training and fine-tuning processes.

MATERIALS AND METHODS Pre-processing of input sequence
The input of the MutFormer model is an amino acid sequence that can be either a single protein (with a missense mutation) or the concatenation of a pair of proteins (a mutated protein and its corresponding reference protein). Each amino acid is considered as a token. The maximum input length of MutFormer was set to 1,024, where protein sequences longer than 1,024 need to be cut into segments (see Supplemental Methods 1.1 for details). In the preprocessing step, some special tokens were added to the sequence ( Figure 1A). As the input sequence may be a cropped sequence, we add a "B" token to the real start position of the protein and a "J" token to the real end of the protein so that a cropped start/end (without B or J) can be distinguished from a real start/end (with B or J). B and J letters were chosen because they are not included in the current amino acid code table. In the original BERT model, the first token of every sequence is always a [CLS] token and the final hidden state corresponding with this token is used as the aggregate sequence representation for classification tasks; a [SEP] token is used to separate different sentences and is also placed at the very end of the input. 24 We followed this practice in MutFormer ( Figure 1A).

The MutFormer model
The MutFormer architecture was implemented on top of the classic BERT architecture. The MutFormer model consists of three primary parts: embeddings, convolutions, and the transformer body ( Figure 1B).
The embedding layers create positional, label, and token embeddings for the input sequence. We denote the input sequence length as S, and hidden embedding size as H. For the convolutions of MutFormer, two different integrations into the model were tested: (1) four convolution layers, where the convolution outputs are the only input to the following transformer body, and (2) an integrated approach with skip connections in which the original embedding outputs and convolution outputs are combined and fed into the transformer body (the input of the transformer body is a sum of the embedding output and the result of applying two or four convolutions). Each convolution used a kernel size of 3, and no pooling or concatenation was applied (only the convolution operation itself was necessary to take advantage of convolutions' pattern recognition ability).
The transformer body of MutFormer is taken from the original BERT model, along with its self-attention modules. The time and space complexity of self-attention is quadratic in the length of input. 34 Because of the constraint of computational resources, a maximum input sequence length needs to be set. Sequences exceeding the maximum sequence length are trimmed. Regardless of the trim position, the position embedding corresponding to first position was always assigned to the first residue in the trimmed sequence but not the full sequence. For this reason, the position embeddings represent more of a relative position rather than true position within the full protein sequence.

Pre-training of MutFormer on human protein sequences
MutFormer was pre-trained on a database obtained by combining human reference protein sequences (all isoforms) and protein sequences caused by non-synonymous SNVs with more than 1% population frequency in the gnomAD database. 5 For a full description of the pre-training data preparation, see Supplemental Methods 1.1. The original BERT model uses a self-supervised pre-training objective of recovering an original sequence from corrupted (masked) input, from which high-dimensional representations of the sequence are learned. Similar to BERT, the pre-training objective of MutFormer was to predict corrupted amino acid residues from altered sequences. For the corrupted/ masked residue prediction task, for each sequence, a number of residues were randomly selected for corruption. The ones that were selected were corrupted by either (1) replacing them with a [MASK] token or (2) another random amino acid. This was done to encourage the learning of context not only around explicitly masked residues, but on the entire sequence. To ensure enough context was present for the model, a maximum number of 20 amino acids were masked per sequence. This maximum number was chosen through testing; we tested training using a fixed masking percentage of 15%; however, this resulted in non-convergence during pre-training. For this reason, the 20-residue cap was used alongside the 15% maximum, whichever one was lower, to determine the number of residues masked. To facilitate the learning of more dependency knowledge and minimize overfitting, we used dynamic masking throughout the duration of the training: the corrupted residues were changed randomly for each epoch of data trained on. Note that, for MutFormer, the "next sentence prediction" objective used by the original BERT was removed, because protein sequences in aggregate, unlike their natural language counterpart, do not form 'paragraphs' with logical connections between sentences. The pre-training was performed on a single cloud machine instance with one tensor processing unit (TPU) hardware accelerator (TPU v2-8) on the Google Cloud Platform. Depending on the model, pre-training took approximately 100-200 h. We pre-trained MutFormer with three different model sizes (Table 1), as well as a MutFormer model with integrated convolutions. For comparison purposes, we also pretrained two models without convolutions, which were designated MutBERT to indicate the use of the original BERT architecture (Table 1). The hyperparameters used for pretraining as well as training time, for all models, are displayed in Table S1, and the loss and accuracy for the pre-training task are listed in Table S2.

Fine-tuning MutFormer for the prediction of deleterious mutations
MutFormer was fine-tuned on a dataset built from 84K manually annotated pathogenic missense SNVs from the HGMD (version 2016) 2 and SNPs from the gnomAD database 5 with an allele frequency of more than 0.1%. We generated a training and independent validation set from this data. Mutated protein sequences were generated using ANNOVAR, 35 with each sequence containing exactly one mutation. The fine-tuning was performed on a single cloud machine instance with one TPU hardware accelerator (TPU v2-8) on the Google Cloud Platform. While the specific training time varies from model to model, training time corresponded with approximately 200 steps per minute, making the general time range for fine-tuning approximately 1 h. To obtain the best possible results in deleteriousness prediction, we tested three different fine-tuning methods (described below). Per residue classification. The single mutated protein sequence is used as input. The model is tasked with classifying each amino acid in the protein sequence as benign or deleterious. Amino acids that are identical to the reference sequence are labeled as benign, and the mutated residue is labeled as benign if the overall mutation is benign, or deleterious if the overall mutation is deleterious, depending on the true classification of the sequence (loss and metrics for classification are calculated on only the mutation site). This corresponds to the token classification task or named entity recognition task in NLP ( Figure 2A). Single sequence classification. Same as per residue classification, the input is a single mutated protein sequence. The model is tasked with classifying the entire sequence as deleterious or benign (via the [CLS] token). This similar to the sentence classification task (e.g., sentiment analysis) in NLP ( Figure 2B).

ARTICLE
Paired sequence classification. The input is a pair of two sequences: the mutated protein sequence and its corresponding reference sequence. The model classifies the aggregate of the sequences as deleterious or benign through a comparison of the two sequences. This was inspired by the sentence similarity problem (e.g., the MRPC 36 task) in NLP ( Figure 2C).

Exploration of optimal fine-tuning methods
To find the best fine-tuning method, model, and hyperparameters, we performed two different internal comparison tests using our independent validation set. Test 1 compared the MutFormer architecture with the classical BERT architecture, as well as the three different fine-tuning methods, using different hyperparameters. Test 2 compared the use A D C B Figure 1. The MutFormer model architecture (A) Pre-processing procedure of input. The input may be a single protein (with a missense mutation) or a pair of protein sequences (a reference protein and a mutated protein). In the input sequences, the red color indicates the mutated amino acid and the blue color indicates the original amino acid in the reference protein sequence. For the single-sequence case, "B" and "J" tokens are attached to the true start and end of the sequence, respectively. The sequence is then cropped to the maximum sequence length. A [CLS] token and a [SEP] token are then added to the start and end of the whole sequence, respectively. For the paired-sequence case, the B and J tokens are added as normal to both sequences, except now a [CLS] token is placed at the beginning of the first sequence; a [SEP] token is placed between the two; and another [SEP] token is placed at the very end. (B) MutFormer model structure. A system of positional, label, and token embeddings is used to first vectorize the input tokens. Two convolutional operations (four convolution layers (kernel size 3, stride 1, filters 768)) are used to process the embedding representation (for the integrated convolutions model, skip connections are used before both convolution operations). A bidirectional transformer body with self-attention applies a sequence of attention layers to the resulting embedding representation to obtain the output embeddings (MutFormer/MutBERT NL indicates N layers in the transformer body). (C) Pipeline for deleteriousness prediction without external resources. MutFormer output Embeddings are obtained by feeding the input protein sequence (single or paired) to the MutFormer model. A fully connected layer is attached after this to obtain the resulting deleteriousness prediction. (D) Pipeline for deleteriousness prediction with external resources. MutFormer's output embedding and the sequence representation of all external predictions are concatenated, then a fully connected layer is used to obtain the final prediction. Before both tests, some initial testing was done to establish a set of hyperparameter values that worked well with all combinations of methods/models included in the test. For test 1, the initial set of hyperparameters was established based on the three fine-tuning methods (per residue, single sequence, paired sequence) and three different models (MutBERT 8L , MutBERT 10L , and MutFormer 8L ). For test 2, the initial set of hyperparameters was established based on the four models (MutFormer 8L , MutFormer 10L , MutFormer 12L , and MutFormer 8L (with integrated convolutions) ) being tested. A full list of hyperparameters used is detailed in Table 3

Incorporation of external predictions
To create a model capable of the best possible performance in deleteriousness prediction, in addition to using protein sequence analysis, when training our final models (displayed in our final testing) (Figure 4), MutFormer also incorporated prediction values from previous methods published in literature. Computational predictions from previously published methods were given to MutFormer as input in the following way. First, using ANNOVAR, predicted scores for all mutations within a newly generated test set were obtained from the dbNSFPv3 database. 11 These scores were standardized between 1 and 2, and all missing predictions were assigned values of 0. A fully connected dense layer was connected to these inputs, and the output of this dense layer was concatenated with the original model output. Another dense layer after this concatenated result was then connected to the output node to produce the end prediction result ( Figure 1D). This incorporation strategy prevents the model from becoming reliant on external predictions, limiting the weighting of sequence analysis vs. external predictions to about 1:1 in MutFormer's prediction (since the concatenated output of the two sequences of information is of the same length). To confirm this use ratio, upon completing fine-tuning of the various models, we analyzed the weights of both the combining dense layer, as well as the final output layer. When calculating the weighting sum over both layers, we found, for all models, an approximate weighting of 0.5 for MutFormer's sequence analysis, and 0.5 for external predictions. In particular, the fine-tuned model that represents MutFormer in Figures 4, 5, S3, and S6 has a weighting of 0.485 for sequence analysis and 0.515 for external predictions.

Testing MutFormer against existing methods of deleteriousness prediction
To assess the performance of MutFormer against existing methods of deleteriousness prediction, a total of five testing datasets were used. Out of these five datasets, three are non-gene-specific and non-disease-specific datasets, and two are gene-specific mutation datasets (details for each testing dataset used are outlined in Table 2). For each dataset, filtering was performed using the reference sequences to ensure that no bias was present: all mutations that shared reference sequences with any mutation present in the pre-training data were deleted from the testing sets, and all mutants with identical reference sequences present in any of the independent test sets were removed from the fine-tuning training data before model training.
To allow for a more comprehensive evaluation of the performance of MutFormer with different levels of "fit" on a wide range of data (models with a higher fit will perform better on more similar data, but worse on more dissimilar data; models with a lower fit will have the opposite tendencies), different MutFormer models with varying hyperparameters that affected a model's level of fit were trained (an analysis of MutFormer's performance with varying levels of fit is analogous to an analysis of a receiver operator characteristic (ROC)-type curve, where the performances of MutFormer on similar vs. dissimilar data is compared for different fit levels). In this test, the number of freezing layers and batch size were varied, while all other hyperparameters were set to the best ones found during our hyperparameter test 2 (see above). From these results, testing sets 3-5 showed more variation with different fit parameters than sets 1-2 did (results from all test runs are summarized in Figure S2). All models were then tested on all testing datasets, and the overall best-performing model across all testing datasets was used to represent MutFormer in our comparison. Batch sizes of 16, 32, and 64 were tested in conjunction with freezing layer numbers of 0, 5, 6, and 8 (full hyperparameter description in Table 3). Freezing layer numbers are defined as the number of transformer body layers that were frozen, starting from the first layer (for MutFormer 8L with integrated convolutions, our current best-performing model, eight layers is the total number of transformer body layers). For any freezing layer number greater than 0, the embedding layers were frozen as well (through testing on our validation set we found that leaving the embedding layers trainable while freezing the transformer layers significantly decreased performance). Each model was trained for 14k steps, and checkpoint steps 6k, 8k, 11k, 12k, and 14k were evaluated.
For MutFormer with the incorporation of external scores, we tested various models with varying levels of fit, with an initial set of hyperparameters found through testing based on our independent validation set. The full set of hyperparameters for both of these MutFormer models is displayed in Table 3.
When fine-tuning our final models (used in our comparison of MutFormer vs. existing methods), to increase MutFormer's overall generalization ability and limit overfitting, data augmentation was implemented for the fine-tuning training data: for all epochs of data, each datapoint had a 50% chance of being altered. Those that were selected to be altered would be trimmed down to anywhere from 50% length of the original sequence to 100% length of the original sequence. Trimming was done around the mutation site to ensure the mutation site on average stayed in the same location (in the middle) in the sequence before and after trimming. Epochs were also shuffled independently of each other.

Effect of MutFormer's use of convolutions on the pre-training task
During pre-training, we trained models of different sizes for both the MutFormer architecture and MutBERT (MutFormer model without convolutions) architecture (Table 1). The loss and accuracy on the pre-training task are shown in Table S2. According to our results, the accuracy of MutFormer 8L on the pretraining task test set was 54.5% higher than MutBERT 8L , indicating the advantage of the convolutions. In addition, the accuracy of MutFormer 8L was 27.7% higher than that of MutBERT 10L , despite the latter having two more transformer layers and 10M more parameters (the subscript of each model name indicates only the number of transformer layers but does not consider the two convolutional layers). This outperformance despite its smaller size verifies that the improved performance of the model with convolutions was not simply due to the additional number of parameters or additional layers.

Performance of different fine-tuning methods and hyperparameters
As a part of our internal comparison test 1, we fine-tuned MutFormer and MutBERT using three methods: per residue classification, single sequence classification, and paired sequence classification (Figure 2; see Materials and methods for details). The ROC curves and corresponding area under the curve (AUC) for deleteriousness prediction are shown in Figures 3A-3C. Figure 3D shows a summary of the performance comparison of the three methods; paired sequence classification performed best, followed by per residue classification, and the optimal results were achieved by using a maximal input sequence length of 512 (use of the paired sequence method means an aggregate sequence length of 1,024) ( Figure 3F). Upon examination of the performance of different model architectures, as shown in Figure 3E, MutFormer 8L outperformed MutBERT 8L and MutBERT 10L for each fine-tuning method, indicating the advantage of the MutFormer architecture.

MutFormer's use of integrated convolutions
In our internal comparison test 2, two different strategies for implementing the convolutions were tested: classic MutFormer, and MutFormer (with ARTICLE integrated convolutions). Justification for the second implementation strategy is as follows: while the convolution mechanism should in theory be able to create a representation that will enable the model to best interpret the protein sequence, some information that is present in the original raw embedded sequence may be lost in practice through the convolutions. To solve this, the integrated convolutions model, instead of feeding the embeddings through the convolutions linearly, uses skip connections that result in the convolutions acting as an integrated part of the original embedding layers, allowing the transformer model to access both the convolution filtered representation of the sequence as well as the original embedded representation. In our internal comparison test 2, we compared the performance of MutFormer (with integrated convolutions) in paired sequence classification of our independent validation set to that of the other three original MutFormer architecture models ( Figure S1). ROC curves are shown in Figure S1A, and a summary comparison histogram of the four different models tested is shown in Figure S1B. Overall, the margins of difference are small but, based on the results, the performance of the MutFormer model with integrated convolutions is higher than that of the original MutFormer model; even with only eight transformer layers, the integrated convolutions model outperformed MutFormer 12L , which had more layers and generally a better prediction ability than MutFormer 8L .

Comparison with existing variant prediction methods
As paired sequence classification performed best, for the comparison of MutFormer vs. other methods, this fine-tuning method was used. MutFormer's best overall performance was achieved by training our best model, MutFormer 8L (with integrated convolutions) , on a batch size of 32, 0 freezing layers, and access to external predictions (for full hyperparameter descriptions, see Table 3). MutFormer's performance was compared against a variety of existing methods of deleteriousness prediction, including sequence alignment/homology-based scores (SIFT, 13 PolyPhen-2, 14 LRT, 39 MutationTaster, 40,41 MutationAssessor, 42 FATHMM, 43 PROVEAN, 44 phastCons, 45 and SiPhy 46 ), ensembl-based scores (CADD, 16,19 MetaSVM, 17 MetaLR, 17 VEST3, 47 and DANN 48 ), conservation based scores (GERP++, 15 PhyloP, 49 and fitCons 50 ), we well as some other deep learningbased approaches (FATHMM-MKL, 51 VARITY, 52 and MVP 22 ), out of which MVP and VARITY are recently developed methods. In our comparison, existing methods' predictions were processed in the following way (note that this differs from the incorporation strategy of external scores into MutFormer as input): each method's predictions were standardized from 0 to 1, based on prediction values of all possible missense mutations present in the dbNSFPv3 database. 11 Missing predictions were automatically assigned a prediction value of 0. Both non-inverted and inverted prediction identities (1 = deleterious, 0 = benign and 0 = deleterious, 1 = benign) were tested across our fine-tuning A C B Figure 2. Different fine-tuning methods tested in this study (A) Per residue classification. The input is a protein sequence that contains exactly one variant. Each residue (amino acid) is given a label of benign/deleterious. Benign variants and residues that are identical to the reference sequence are labeled as benign. The fine-tuning task is to predict the label of each amino acid. This is similar to token classification problems (e.g., named entity recognition) in NLP. (B) Single sequence classification. The input is a protein sequence that contains exactly one variant with unknown significance. The embedding of the [CLS] token in the last layer is used to predict whether the sequence contains a deleterious variant. This is similar to sentence classification problems (e.g., sentiment analysis) in NLP. (C) Sequence pair classification. The input is a pair of two sequences: a reference protein sequence and a mutated protein sequence (with a benign or deleterious variant in the center). The embedding of the [CLS] token in the last layer is used to predict whether the mutated sequence contains a deleterious variant. This is similar to sentence pair classification problems (e.g., sentence similarity) in NLP.

ARTICLE ll
The Innovation 4(5): 100487, September 11, 2023 training data, and inversion of scores was done accordingly for each algorithm being compared. We generated both an ROC curve and, because of the unbalanced nature of some of our datasets, a precision-recall-gain curve for each dataset. For dataset 4, which only included rare benign examples, a threshold for each existing method, chosen by taking the point closest to the upper left corner on a ROC curve based on MutFormer's fine-tuning training data, was used to calculate a method specificity for each method. Upon analyzing the performances of the different testing datasets, we found that the best overall performing MutFormer model outperforms previous methods of deleteriousness prediction in non-gene-specific and non-disease-specific datasets (more similar to MutFormer's fine-tuning training dataset: sets 1-3). On the two gene-specific databases (sets 4-5), which contain data less similar to that of MutFormer's fine-tuning data, MutFormer's performance in comparison with other methods expectedly drops, while still matching the performance of various existing methods. For MutFormer without external predictions, its performance is first among non-MutFormer methods for sets 2 and 3, among the top for set 1, and drops further than MutFormer with incorporated external predictions for sets 4 and 5. ROC curves for all datasets are displayed in Figure 4, and precision-recall-gain curves of the same testing results for datasets 1, 2, 4, and 5 are displayed in Figure 5. A bar plot display of all ROC and PRG AUC values for datasets 1, 2, 4, and 5 is also presented in Figure S6. Only three methods ever have PRG AUC values below 0 (these values were clipped to 0 for the best display quality in the bar plot). All three of these instances occurred for dataset 4: fitCons had a PRG AUC performance of À0.312, phy-loP7way_vertebrate À0.633, and phyloP20way_mammalian À0.293. Because of a large number of existing methods compared, we also used the Delong test of ROC to statistically assess the pairwise probability that the given methods' ROC curves were significantly different. Delong test results for datasets 1, 2, 4, and 5 are displayed in Figure S3.

Auxiliary tests
In addition to the results of the above tests, to further assess various aspects of MutFormer's strengths and weaknesses in various other areas, we have

Precomputed deleteriousness scores for all missense mutations
To facilitate future use by other studies, we precomputed deleteriousness scores for all missense mutations using the best-performing MutFormer model. The inference was done on a cloud TPU device (v2-8), which took approximately 11.5 h. These scores can be directly used in the ANNOVAR software to annotate missense variants identified from genome or exome sequencing, and are organized in a flat-file format, allowing for easy use in other functional annotation software tools to complement the dbNSFP database, which has a variety of other prediction scores for missense mutations in the human genome.

DISCUSSION
In the current study, we present MutFormer, a transformer-based machinelearning model to predict the deleteriousness of non-synonymous SNVs using protein sequence as the primary feature. We pre-trained MutFormer on reference protein sequences and alternative protein sequences resulting from common ge-netic variants in the human genome and tested different fine-tuning methods for deleteriousness prediction. During our evaluation processes, MutFormer outperformed multiple commonly used methods and had comparable performances with other methods even when tested on datasets that were less similar to MutFormer's training data (gene-specific data, sets 5 and 6). Below we discuss several advantages and limitations of the MutFormer method and its computational package.
Although a large number of computational tools have been developed over the years on predicting the deleteriousness of non-synonymous mutations, to the best of our knowledge, MutFormer is among the first batch of tools that utilize transformer models to adapt the biological sequence analysis problem as a language analysis problem. A similar model of note is ProtBERT, 53 a previous application of the BERT architecture to protein contexts. Despite both using the transformer architecture, MutFormer differs from ProtBERT in several key aspects: (1) MutFormer makes use of convolutions to learn its own vocabulary, while ProtBERT uses a fixed vocabulary; (2) MutFormer was pre-trained on human protein sequences and common variants, while ProtBERT was trained on reference sequences of all species with sequence information; and (3) MutFormer's primary goal was deleteriousness prediction while ProtBERT focused on subcellular localization of proteins and secondary structure prediction. ConvBERT 54 is another method that like MutFormer, uses a combination of CNNs and the transformer architecture. However, MutFormer and ConvBERT differ both in their architectural incorporation of convolutions as well as their motivation for using convolutional layers. The primary motivation for the use of convolutions in Mutformer is that the tokenization of protein sequences into individual residues does not necessarily represent the breaking of a given sequence into individual residues (the true residues of proteins are more accurately amino acid motifs within the protein's 3D structure). For MutFormer, the convolutions are placed as part of the embedding layers to create more useful and efficient representations of a given sequence. In contrast, regular NLP tasks, such as those which ConvBERT attempts to address, do not have such problems (words are generally more or less residues in natural language). Instead, ConvBERT's architecture uses convolutions within the attention module, with the motivation being the use of convolutions as a way to efficiently capture short-range dependencies.
In addition, the training process of MutFormer is simple and straightforward. MutFormer uses a self-supervised pre-training strategy and therefore does not require any labeled data. For this reason, a large model with hundreds of millions of parameters can be trained on a large amount of non-curated data. On this note, while the current study focused on the human genome exclusively, it is conceivable to include other well-annotated genomes from other species in future studies to see whether increased complexity in the sequence space during pre-training can further improve performance. In the fine-tuning stage, MutFormer learns the deleteriousness of mutations based on the labeled training data as well as its understanding of protein sequence already learned in the pre-training stage, allowing a small amount of fine-tuning data to be effectively used to achieve an accurate result. Furthermore, transformers consider attention, which is not only useful for understanding context in language processing problems, but could also give important insights into the deleteriousness effects that amino acids can have under different sequence contexts. Even though direct feature attribution is not possible with a model containing millions of intertwined model parameters such as MutFormer, our analyses of MutFormer's model weights show that MutFormer is able to prioritize and model these relationships through the use of convolutions and attention.
There are also several limitations of the current study. First, the training data and testing datasets are still of limited size, and testing on large-scale experimentally or clinically supported datasets would result in a more effective evaluation of usability. In the future, MutFormer can be evaluated on large-scale genome sequencing data followed by manual review, to determine whether it helps to prioritize deleterious variants in clinical sequencing settings. Second, because of computational limitations, we did not fully test all parameters during training. As a result, it is likely that our results are not completely optimized; larger models using longer maximum sequence lengths would also be able to outperform the current MutFormer models (e.g., a 12-layer MutFormer with integrated convolutions should perform noticeably better than the best current MutFormer model with only eight attention layers). Third, in the deleteriousness prediction of missense mutations, it is likely impossible for a given model to obtain all required evidence from sequence data alone, so incorporation of other features, such as a three-dimensional (3D) structure (i.e., analyzing 3D structure to scale attention with 3D distance, or labeling sections as belonging to beta sheets or alpha helices for better prediction of deleteriousness), methylation, clinical phenotypic information (i.e., using this knowledge to prioritize certain genes), and other features that could significantly affect proteins' behavior, could reasonably improve overall understanding and thus performance.
In summary, MutFormer is a novel transformer-based method to predict the functional effects of missense mutations. We hope that MutFormer can bring new insights to the bioinformatics community, by being a language model capable of improving our understanding of the language of proteins. Given that MutFormer used complementary information that other bioinformatics tools developed for deleteriousness prediction, we also envision that they could be combined to reach consensuses on predictions, which may be useful in implementation into current clinical guidelines.

DATA AND CODE AVAILABILITY
The source code to run MutFormer, all six pre-trained models, a reproducible workflow, and the best-performing fine-tuned models are available at the GitHub repository: https://github.com/WGLab/MutFormer.